Machine Learning with Spark

Back to Home

01. Introduction
02. Machine Learning in Spark
03. Feature Extraction
04. Numeric Features
05. Numeric Features [Example Code]
06. Text Processing
07. Text Processing [Example Code]
08. Quiz - Creating Features
09. Quiz - Creating Features Jupyter Notebook
10. Quiz [Solution Code]
11. Dimensionality Reduction
12. Supervised ML Algorithms
13. Linear Regression
14. Linear Regression
15. Quiz - Linear Regression Jupyter Notebook
16. Quiz [Solution Code]
17. Logistic Regression
18. Unsupervised ML Algorithms
19. Quiz - K-means
20. Quiz - K-means Jupyter Notebook
21. Quiz [Solution Code]
22. ML Pipelines
23. ML Pipeline Example
24. Model Selection and Tuning
25. Model Selection and Tuning Example
26. Quiz - Model Tuning
27. Quiz - Model Tuning Jupyter Notebook
28. Quiz [Solution Code]
29. Summary

Back to Home

19. Quiz - K-means

We might want to take a look at the distribution of the Title+Body length feature we used before and instead of using the raw number of words create categories based on this length: short, longer,…, super long.

In the questions below I'll refer to length of the combined Title and Body fields as Description Length (and by length we mean the number of words when the text is tokenized with pattern="\W").

How many times greater is the Description Length of the longest question than the Description Length of the shortest question (rounded to the nearest whole number)?

Tip: Don't forget to import Spark SQL's aggregate functions that can operate on DataFrame columns.

123

356

753

SOLUTION:

753

What is the mean and standard deviation of the Description length?

170, 64

180, 192

180, 213

190, 319

SOLUTION:

180, 192

Let's use K-means to create 5 clusters of Description Lengths. Set the random seed to 42 and fit a 5-class K-means model on the Description Length column (you can use KMeans().setParams(…) ).
What length is the center of the cluster representing the longest questions?

180

2634

7532

SOLUTION:

2634

Next Concept

Learn Udacity: click here to learn more :)